1
00:00:00,000 --> 00:00:14,040
This is Hacker Public Radio episode 4,084 for Thursday the 28th of March 2024.

2
00:00:14,040 --> 00:00:18,120
Today's show is entitled, Cloud Learning.

3
00:00:18,120 --> 00:00:22,840
It is hosted by Daniel Person and is about 10 minutes long.

4
00:00:22,840 --> 00:00:25,440
It carries a clean flag.

5
00:00:25,440 --> 00:00:42,640
The summary is, my experience trying to train a model online.

6
00:00:42,640 --> 00:00:47,880
Hello Hacker's and welcome to another episode, Daniel here and today I'm going to talk

7
00:00:47,880 --> 00:00:51,160
about Cloud Learning.

8
00:00:51,160 --> 00:00:56,760
I'm using a machine learning model and training it in the cloud.

9
00:00:56,760 --> 00:01:05,400
This was a topic that I went into this Christmas break because I was fed up with the

10
00:01:05,400 --> 00:01:06,400
advent of code.

11
00:01:06,400 --> 00:01:11,920
I couldn't really bear doing more of that, so I needed it at a topic to look into.

12
00:01:11,920 --> 00:01:19,200
So I said, okay, I have this model, whatever it is, in this case it was a TTS model.

13
00:01:19,200 --> 00:01:27,840
I wanted to train a voice to speak a particular language and create something that I could

14
00:01:27,840 --> 00:01:29,080
use later on.

15
00:01:29,080 --> 00:01:34,040
So I wanted to need to find somewhere in the cloud to train this model.

16
00:01:34,040 --> 00:01:42,080
I have figured out a training the model on my own, on my computer, will take about eight

17
00:01:42,080 --> 00:01:46,960
days to run through a full training cycle.

18
00:01:46,960 --> 00:01:53,360
But I looked online, I could find places where I could train it in eight hours or in

19
00:01:53,360 --> 00:01:59,200
10 hours, 12 hours and so on, depending on which graphics cards I were using, depending

20
00:01:59,200 --> 00:02:02,560
on how many of them I was running and so on.

21
00:02:02,560 --> 00:02:07,960
So I wanted to try to use this on in the cloud.

22
00:02:07,960 --> 00:02:10,800
And I was talking to different cloud vendors.

23
00:02:10,840 --> 00:02:14,880
So I reached out to Microsoft, looked at their solution.

24
00:02:14,880 --> 00:02:22,640
Sadly, after two hours of video watching, trying to learn their platform and how to set

25
00:02:22,640 --> 00:02:29,600
things up and where to go and what tools to use, I gave up because I still think Microsoft's

26
00:02:29,600 --> 00:02:35,160
way of structuring things is not intuitive to me.

27
00:02:35,160 --> 00:02:39,000
It's very confusing, so I couldn't really get into that.

28
00:02:39,000 --> 00:02:41,360
I didn't want to spend more time on it.

29
00:02:41,360 --> 00:02:50,440
Two hours trying to start even figuring out what tools to use is not fun for me.

30
00:02:50,440 --> 00:02:53,080
So I also tried Google.

31
00:02:53,080 --> 00:03:01,000
It was really easy to find my way there and I figured out that I wanted to use Vertex AI.

32
00:03:01,000 --> 00:03:05,040
What I wanted to use is the model that I already had.

33
00:03:05,080 --> 00:03:07,680
So I wanted to train that.

34
00:03:07,680 --> 00:03:12,600
A lot of days kind of cloud providers gives you these kind of notebooks that you should

35
00:03:12,600 --> 00:03:15,360
put in your model and run it there.

36
00:03:15,360 --> 00:03:21,360
But this model was so complex that I needed to check out to get repository or run a

37
00:03:21,360 --> 00:03:22,960
Docker image.

38
00:03:22,960 --> 00:03:30,080
But in Vertex AI, you could run your own Docker images and connect them to cloud storage,

39
00:03:30,080 --> 00:03:32,720
which was not that complicated actually.

40
00:03:32,720 --> 00:03:37,120
You could do that pretty simply and I set up something that could train and then I

41
00:03:37,120 --> 00:03:44,400
wanted to run it on some GPU power and there were the problem because Google don't give

42
00:03:44,400 --> 00:03:48,960
you an graphics cards if you don't ask them for it.

43
00:03:48,960 --> 00:03:53,480
So you needed to sign up and ask for a graphics card.

44
00:03:53,480 --> 00:03:57,920
So on Christmas day, I asked for a graphics card.

45
00:03:58,000 --> 00:04:05,040
I asked for one graphics card of one type and four graphics card of an older type.

46
00:04:05,040 --> 00:04:14,400
And it took me about four weeks, three to four weeks until they actually gave me access to one

47
00:04:14,400 --> 00:04:15,920
of these cards.

48
00:04:15,920 --> 00:04:19,840
I haven't been able to run any jobs on that card yet.

49
00:04:19,840 --> 00:04:24,480
I still am still trying to figure that one out.

50
00:04:24,560 --> 00:04:33,680
But just asking for a card and that taking that much time was not really a super good experience.

51
00:04:33,680 --> 00:04:40,480
They said that it should take about two to three business days, not four weeks.

52
00:04:41,920 --> 00:04:43,840
But still I got access.

53
00:04:44,800 --> 00:04:50,480
Then I went over and looked at Amazon because I want to do of course try all of the big ones.

54
00:04:51,360 --> 00:05:01,040
And Amazon, frankly, just said no, you will not get any GPU power at us in our tooling.

55
00:05:01,040 --> 00:05:03,040
You need to use SageMaker.

56
00:05:03,040 --> 00:05:08,480
So SageMaker is pretty much use a notebook and train on GPUs in SageMaker.

57
00:05:09,280 --> 00:05:13,440
But you still need to ask for GPUs so I could be declined there as well.

58
00:05:13,440 --> 00:05:15,600
But you can't run your custom images.

59
00:05:16,240 --> 00:05:19,200
So what I wanted to do, I was not allowed to do.

60
00:05:19,280 --> 00:05:23,520
They just said no, which was really a really sad.

61
00:05:23,520 --> 00:05:25,520
So I couldn't really use their service as well.

62
00:05:26,480 --> 00:05:35,120
Then I found this service that was pay as you go and they were pretty much GPU as a service.

63
00:05:36,000 --> 00:05:39,120
And I'm going to release a video about them.

64
00:05:39,120 --> 00:05:42,720
I'm not sure if I actually can talk about the company.

65
00:05:42,720 --> 00:05:49,040
But if you're following my channel, you will see that video eventually when we have released that.

66
00:05:49,120 --> 00:05:54,400
So I'm talking and working with them around this video.

67
00:05:55,680 --> 00:06:00,000
It's a very early company, a early concept.

68
00:06:00,000 --> 00:06:03,360
So they haven't really released everything yet.

69
00:06:03,360 --> 00:06:10,480
But I think that their way of doing it is really interesting and the right way to do it.

70
00:06:11,120 --> 00:06:16,320
They have two different concepts pretty much or three different concepts.

71
00:06:16,400 --> 00:06:21,600
They have an environment where you can say I'm want to do things in this data center pretty much.

72
00:06:21,600 --> 00:06:25,920
So you have an environment and you can say that I want to run it this in Norway for instance.

73
00:06:26,720 --> 00:06:34,000
Then when you have set up that environment, you could create these kind of volumes where you put your data.

74
00:06:35,040 --> 00:06:41,840
So I created a volume of 100 gigabytes where I put all my data and my operating system and so on.

75
00:06:41,840 --> 00:06:44,800
And then you could start virtual machines.

76
00:06:44,800 --> 00:06:49,680
So I started a virtual machine with just CPU power and my volume.

77
00:06:50,240 --> 00:06:58,800
I installed all the different dependencies and so on that I needed with the CPU power, which was very

78
00:06:58,800 --> 00:07:07,040
cheap. So it was a cheap way of getting my dependencies and my model and all my data and so on

79
00:07:07,040 --> 00:07:10,720
into my model set up and ready to do some training.

80
00:07:11,680 --> 00:07:19,360
Then I shut down that CPU powered machine and I took a machine where they had

81
00:07:20,240 --> 00:07:25,280
different amount of graphics cards. So they were running a

82
00:07:25,920 --> 00:07:31,680
4,5,000, 6,000, a 100s, a 1,100s, a

83
00:07:31,680 --> 00:07:41,040
H100s and L40s. So I had a bunch of these kind of cards and I have machines that had one card,

84
00:07:41,040 --> 00:07:49,200
two cards, four cards and eight cards. So some of the machines were super powerful in order to train a lot.

85
00:07:49,920 --> 00:07:57,120
But of course it become more expensive to run with a lot of cards, but it was still affordable.

86
00:07:57,200 --> 00:08:03,200
I would say. So if you're running a load and you really want it to be done quick,

87
00:08:03,200 --> 00:08:09,360
you just put more GPUs on it and then start it up with your volume that you have prepared.

88
00:08:09,360 --> 00:08:15,680
Run your data load and then shut it down. Perhaps start a CPU load again in order to download the

89
00:08:15,680 --> 00:08:22,960
the thing. And the best thing of this is that you either run it directly in your Linux environment

90
00:08:22,960 --> 00:08:32,960
in this virtual machine or you could go into a VNC house and just look at it and run your things there

91
00:08:32,960 --> 00:08:42,720
or give it its own IP on the internet and login using SSH so you had full access to the machine.

92
00:08:42,720 --> 00:08:51,680
You could do whatever you want on these machines. So and just used GPUs. So it was GPU as a service

93
00:08:51,760 --> 00:08:59,360
in its purest ways. So I really liked that approach to train things online.

94
00:09:00,720 --> 00:09:09,520
I haven't found any other service that does it similarly and as well. There is of course the

95
00:09:09,520 --> 00:09:17,440
option of running in Linux which I can buy as bought up now. They have similar solutions but they're

96
00:09:17,520 --> 00:09:25,920
very expensive. What I've found so far. You also could run it in a digital ocean. I looked at those.

97
00:09:26,880 --> 00:09:34,000
They are very interesting. When you start there you need first give them five bucks in order to just

98
00:09:34,000 --> 00:09:40,400
sign up and when you had to sign up then you could get access to actually run things but you had to

99
00:09:40,400 --> 00:09:50,160
ask for machine power again and the only accepted to run notebooks. I had to pay them five bucks

100
00:09:50,160 --> 00:09:57,440
and I couldn't use them. When it actually came down to it. So yeah I can never see those five

101
00:09:57,440 --> 00:10:03,280
bucks again because I can't use them. So they said okay pay us five bucks and you will use that for

102
00:10:03,280 --> 00:10:10,080
your training later on but because I will never train with them I just gave them five bucks pretty

103
00:10:10,480 --> 00:10:19,680
much. So this is what I have experienced when I have tried to train machine learning tasks

104
00:10:19,680 --> 00:10:29,920
or models online using GPUs. Have you tried to do this and did you have a different experience?

105
00:10:29,920 --> 00:10:35,760
Perhaps you have tried Microsoft and found it very easy and could give me some hints. Perhaps

106
00:10:35,840 --> 00:10:43,280
record an episode explaining how to do that so I can figure it out myself as well. Or if you have

107
00:10:43,280 --> 00:10:49,200
any other experience please share with the rest of the community. I'm very interested in this topic.

108
00:10:49,840 --> 00:10:54,160
I hope that you liked this episode and I hope to see you in the next episode.

109
00:10:57,440 --> 00:11:02,400
You have been listening to Hecker Public Radio at Hecker Public Radio.org.

110
00:11:02,480 --> 00:11:08,400
Today's show was contributed by a HBR this night like yourself. If you ever thought of recording

111
00:11:08,400 --> 00:11:15,440
podcast, click on our contributally to find out how easy it means. Posting price we are

112
00:11:15,440 --> 00:11:22,640
has been kindly provided by an onsthost.com, the internet archive and our synced.net.

113
00:11:22,640 --> 00:11:27,600
On the satellite stages, today's show is released on our creative comments,

114
00:11:27,600 --> 00:11:34,800
attribution for going to international license.